[Feature] Enable inference support for Deepseekr1-w8a8-MTP #1994

Irving11-BKN · 2025-07-24T10:01:53Z

Support the inference of the Deepseekr1-w8a8-mtp model with statically-quantized shared_head in MTP layers.

Signed-off-by: curryliu [120010041@link.cuhk.edu.cn]

vLLM version: v0.9.2
vLLM main: vllm-project/vllm@6eca337

Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>

codecov · 2025-07-24T10:16:28Z

Codecov Report

❌ Patch coverage is 37.50000% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 71.43%. Comparing base (ff97740) to head (ad4fd7c).
⚠️ Report is 638 commits behind head on main.

Files with missing lines	Patch %	Lines
vllm_ascend/quantization/quant_config.py	33.33%	6 Missing ⚠️
vllm_ascend/models/deepseek_mtp.py	42.85%	4 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1994      +/-   ##
==========================================
- Coverage   71.49%   71.43%   -0.06%     
==========================================
  Files          86       86              
  Lines        9131     9145      +14     
==========================================
+ Hits         6528     6533       +5     
- Misses       2603     2612       +9

Flag	Coverage Δ
unittests	`71.43% <37.50%> (-0.06%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…ect#1994) Support the inference of the Deepseekr1-w8a8-mtp model with statically-quantized shared_head in MTP layers. - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@6eca337 Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>

…ect#1994) Support the inference of the Deepseekr1-w8a8-mtp model with statically-quantized shared_head in MTP layers. - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@6eca337 Signed-off-by: curryliu <120010041@link.cuhk.edu.cn> Signed-off-by: weijinqian_v1 <weijinqian@huawei.com>

### What this PR does / why we need it? Fixes unable to load `qwen3_moe` quantized weights issue due to #1994 ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Add a `qwen3_moe` W8A8 quantized model in `tests/e2e/multicard/test_qwen3_moe.py` - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@c494f96 --------- Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

…roject#2219) ### What this PR does / why we need it? Fixes unable to load `qwen3_moe` quantized weights issue due to vllm-project#1994 ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Add a `qwen3_moe` W8A8 quantized model in `tests/e2e/multicard/test_qwen3_moe.py` - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@c494f96 --------- Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

…ect#1994) Support the inference of the Deepseekr1-w8a8-mtp model with statically-quantized shared_head in MTP layers. - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@6eca337 Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>

…roject#2219) ### What this PR does / why we need it? Fixes unable to load `qwen3_moe` quantized weights issue due to vllm-project#1994 ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Add a `qwen3_moe` W8A8 quantized model in `tests/e2e/multicard/test_qwen3_moe.py` - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@c494f96 --------- Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

…ect#1994) Support the inference of the Deepseekr1-w8a8-mtp model with statically-quantized shared_head in MTP layers. - vLLM version: v0.9.2 - vLLM main: vllm-project/vllm@6eca337 Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>

…roject#2219) ### What this PR does / why we need it? Fixes unable to load `qwen3_moe` quantized weights issue due to vllm-project#1994 ### Does this PR introduce _any_ user-facing change? None ### How was this patch tested? Add a `qwen3_moe` W8A8 quantized model in `tests/e2e/multicard/test_qwen3_moe.py` - vLLM version: v0.10.0 - vLLM main: vllm-project/vllm@c494f96 --------- Signed-off-by: zhoux77899 <zhouxiang100@huawei.com>

github-actions bot added the module:quantization label Jul 24, 2025

�[200~[Feature] Enable inference support for Deepseekr1-w8a8-MTP

ad4fd7c

Signed-off-by: curryliu <120010041@link.cuhk.edu.cn>

wangxiyuan approved these changes Jul 29, 2025

View reviewed changes

jianzs merged commit ca8007f into vllm-project:main Jul 29, 2025
24 checks passed

This was referenced Aug 1, 2025

[main][Bugfix] Fix unable to load qwen3_moe quantized weights #2161

Closed

[main][Bugfix] Fix unable to load qwen3_moe quantized weights #2219

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature] Enable inference support for Deepseekr1-w8a8-MTP #1994

[Feature] Enable inference support for Deepseekr1-w8a8-MTP #1994

Uh oh!

Irving11-BKN commented Jul 24, 2025 •

edited by github-actions bot

Loading

Uh oh!

codecov bot commented Jul 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

[Feature] Enable inference support for Deepseekr1-w8a8-MTP #1994

[Feature] Enable inference support for Deepseekr1-w8a8-MTP #1994

Uh oh!

Conversation

Irving11-BKN commented Jul 24, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Jul 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Irving11-BKN commented Jul 24, 2025 •

edited by github-actions bot

Loading

codecov bot commented Jul 24, 2025 •

edited

Loading